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Abstract 



We present an embedding of stochastic optimal control problems, of the 
so called path integral form, into reproducing kernel Hilbert spaces. Using 
consistent, sample based estimates of the embedding leads to a model free, 
non-parametric approach for calculation of an approximate solution to the 
control problem. This formulation admits a decomposition of the problem 
into an invariant and task dependent component. Consequently, we make 
much more efficient use of the sample data compared to previous sample 
based approaches in this domain, e.g., by allowing sample re-use across 
tasks. Numerical examples on test problems, which illustrate the sample 
efficiency, are provided. 



1 Introduction 

While solving general non-linear stochastic optimal control and Reinforcement Learning 
problems remains challenging, some recent work [7] has identified a class of problems that 
admit closed form solutions. Although these solutions require evaluation of a path integral 
- equivalent to evaluation of a partition function, which in itself is a hard problem - they 
allow for the application of Monte Carlo and Variational methods, leading to several practical 
applications, e.g., [15, 1]. In the special case of linear dynamics and quadratic costs, the 
required path integral can be evaluated analytically based on linear operators acting on 
state vectors. Here, we show that, analogously, a suitable embedding of the path integral 
into a reproducing kernel Hilbert space (RKHS) allows it's evaluation in terms of covariancc 
operators acting on elements of the Hilbert space. While this in itself does not yield a 
tractable solution to the SOC problem, consistent estimators of the required operators give 
rise to efficient non-parametric algorithms. 

The change of perspective from the direct estimation of the path integral (which previous 
applications of Monte Carlo methods aimed at) to estimation of operators allows to over- 
come several shortcomings of previous methods while maintaining many of their advantages. 
Most importantly, it can significantly reduce the sample complexity by splitting the problem 
appropriately into an invariant and task varying component, allowing efficient sample re-use 
across tasks and leading to a form of transfer learning - contrast this to the situation where 
any change in the task including, for e.g., different start states, necessitate acquiring new 
samples [15, 16]. Additionally, the approach remains model free, allowing it's application 
to the Reinforcement Learning setting. This is in contrast to variational [9] or function ap- 
proximation [20, 21] approaches, from which it is further distinguished through convergence 
guarantees. The RKHS embedding make the operators state-dimensionality independent, 
leading to better scalability, while prior knowledge about both tasks and dynamics can be 
effectively incorporated by informing choices of sampling procedures and kernel. 
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It is worth noting that, while wc choose to frame our approach in the context of path integral 
stochastic optimal control, it is not restricted to problems which fall into this class. The 
formalisms of linearly solvable MDPs [17], inference control [19] and free energy control [3] 
all require solving an underlying problem of equivalent form, making the methods proposed 
directly applicable in these contexts. Furthermore [10] discusses a formulation which gener- 
alizes path integral control, to derive an optimal policy for general SOC problems. Finally, 
while we focus on finite horizon problems, path integral formulations for discounted and 
average cost infinite horizon problems [18], as well as risk sensitive control [2] also exist. 



2 Path Integral Control 



In this section we briefly review the path integral approach to stochastic optimal control 
[7], for a more detailed treatment, see [8, 15]. Let x e ]R dx be the system state and u € R du 
the control signals. Consider a continuous time stochastic system of the form 

dx = f(x,t)dt + B(x,t)(udt + d£) , (1) 

where d£ is a multivariate Wiener process with E [dS, 2 ] = Q(x,t)dt, and /, B and Q may 
be non-linear functions. In particular note that the system is affine in the controls and 
both noise and controls act in the same subspace. We seek the best Markov policy, i.e., 
u(t) = 7r(x(t),i), with respect to an objective of the form 

C.(X*(T)) + J C(X"(s), s ) + u(s) T Hu( s )d s . (2) 



J n (x.,t) = E x *(t->T)\x 



where T is some given terminal time and the expectation is taken w.r.t. to path of (1) 
starting in x and following policy n. The control cost is further constrained by requiring it 
to satisfy Q = ABH _1 B T for some constant scalar A > 0. 

It can be shown that for problems of this form the optimised objective can be expressed as 

J*(x,f) = min J"(x,t) = -Alog*(x,i) , (3) 

7T 

where ^ is given by the path integral 



tf(x,i) =E Jc o (t _ r) | x \e-^i c ^ x0 ^^ ds ^(X a (T),T) 



(4) 



with *(-,T) = exp{— C.(-)/A}. The expectation in (4) is taken w.r.t. uncontrolled path of 
the dynamics (1), i.e. those under the policy 7r°(-, •) = 0, starting in x t . 

As a consequence of linear control with quadratic control cost and (3), the optimal policy 
7r* (x, t) can be expressed directly in terms of * as 



7r*(x,t) = -H- 1 B(x) r V x J*(x,t) = H-^xf^g^ , 



(5) 



making obtaining ^ the main computational challenge for problems in this class. 

Assuming we arc only interested in the optimal controls at certain time points, say {ii,... lTl } 
with t n = T, it is sufficient to compute the set ^%{x) = ^(x, U) and (4) admits a representa- 
tion in terms of the finite dimensional distribution X = (X°(to), ■ ■ ■ , X°(t n )). Specifically 
using the Markov property of X°(t) and marginalising intermediate states we obtain the 
recursive expression 

*i(x u ) = E Xi+lK [*i(x u ,X i+1 ) ■ . (6) 

Here, 

■i4 i+1 c(x»( s ), s )d 8 



<f>i(x u ,x ti+1 ) =E x o (ti _ 



t i+ i)\x ti ,x H+1 



(7) 



where the expectation is taken w.r.t. uncontrolled path from x t . to x ti+1 . Note that 
— Alog$i can be seen as the (optimal) expected cost for the problem of going from x ti 
to x ti+1 over the time horizon [tj,ij + i] under dynamics and running costs corresponding to 
those of the overall problem given in (2). Hence, the problem naturally decomposes into, 
on the one hand, a set of short horizon problems - or indeed a nested hierarchy of such $ 
- and on the other hand, a set of recursive evaluations backwards in time. 
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3 Embedding of the Path Integral 



We now demonstrate that (6) can be expressed in terms of linear operators in RKHSs. While 
the exposition is necessarily short, [6] provides a more through treatment of the theory of 
RKHSs while [11, 14, 12, 4] provide the basic concepts on which we build. 



3.1 Analytical One Step Path Integral Embedding 

Let H k denote the reproducing kernel Hilbert space of functions Z — >• R associated with the 
positive semi-definite kernel k(-,-). Further, let V z be the set of random variables on Z. 
Following [11], we define the embedding operator £ k : V z — > H k by 

(h,£ k [Z\) =E z [h(Z)] VZeV z ,heH k , (8) 

which constitutes a direct extension of the standard embedding of individual elements z € Z 
into H k used more commonly in the literature. 

In the problem under consideration, the interest lies with the evaluation of vpj given in (6) 
and hence, in a suitable embedding of X i+ i\Xi which would allow the required expectation 
to be expressed as an inner product in some RKHS. Although (8) can be directly applied 
- since for fixed Xi, X i+ i\ Xj, IS cL simple random variable - it is convenient to consider a 
general conditional random variable Z\y as a map y — > V z , yielding random variables over 
Z given a value y € y, and define a conditional embedding operator U lk : V. 1 —¥ H k s.t. 

£ k [Z\y]=U lk o£ l [y] , (9) 

where £ l [y] = l(-,y), i.e., the standard embedding operator of elements y G y used in 
kernel methods. An explicit form of the operator U is given in [14] by means of covariance 
operators, which are generalizations of covariance matrices. Specifically, the uncentered 
covariance operator C ZY for the joint random variable (Z, Y) is given by 

C%y=E {ZtY) [k(Z r )®l(Y,-)] , (10) 

where ® denotes the tensor product. Note that we can see C ZY as an embedding of (Z,Y) 
into the tensor product space H w = H k ® V. 1 , which is the RKHS of the product kernel 
w((z,y),(z',y')) = k(z,z')l(y,y'). Now, under certain technical considerations detailed in 
[4] but beyond the scope of this paper, 

U lk =C z l Y (Cyy)' 1 (11) 

satisfies (9). 

However, as the argument of the expectation, specifically of is not only a function of the 
random variable, i.e., X i+ i, but also of the conditioning Xi, we can not apply (9) directly. 
We proceed by introducing an auxiliary random variable X such that P(X, JQ + i|xj) = 
P(X i+1 \xi)Sx =x . with 6 the delta distribution, hence 



h,£ k 



X i+1 ,X\ Xi 



E X i+ i,X|x 4 



h(X, X J+1 )j = E Xi+1 , Xi [h(*i, X i+1 )} Vhen k . (12) 

Note that treating Xj as constant leads to an alternative formulation. This, although equiv- 
alent in the analytical setting, does however not immediately yield a practical empirical 
estimator as further discussed in the supplementary material. 

Now, assume H^,H^, s.t. e , $ € W', are given 1 . To account for the mismatch 
in the arity of functions in these spaces, may be trivially extended to , a space of 
functions R d " x — > R, using the kernel ip'((u, v), (u', v')) — ip(u, u 1 ), i.e., we identify 
and its tensor product with the RKHS of constant functions. Hence, taking the embedding 
of Xi + \,X\xi into H w = V.^ ® in which the product function of <&j, resides, using 
(6) and further applying (8) and (9) we have 

*i(x) =E Xi+1 | Xi=x [^(x,X i+1 ) • * <+ i(X j+ i)] (13) 



= {$ i ®* i+1 ,S w \X i+1 ,X\Xi = x^ (14) 
= (<S> l ®y i+1 ,U wk o£ k l x }) , (15) 



^.b., H 4, is a space of functions R da = x K da! -> E, while contains functions R dx -»■ : 
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where k is some kernel over R dx of our choosing. As will become apparent in the following 
(see (18)), it is convenient for computational reasons to take k to be ip as it allows for re-use 
of pre-computed matrices over the recursive evaluation of estimates of vp. 

3.2 Finite Sample Estimates 

Evaluation of U - thus also of the path integral embedding (15) - requires evaluation of 
expectations of kernels and remains therefore, in most cases, intractable. However as the 
operators are expressed in terms of expectations, it is straightforward to form empirical 
estimates, leading to practical algorithms. 

First consider the general case, given a set T> = {(z, y)o... m } of i.i.d. samples from (Z, Y). 
An estimate of (10) is given by 

®> = -!>(■> • ( 16 ) 

1=1 

Using the latter in conjunction with (11), a regularized estimate of U lk is given by 

Ug=g k z (G l yy + emI)- 1 g l y , (17) 

where e represents a regularization parameter and g^, G\g represents the vector of em- 
beddings and Gramian respectively, i.e. [g^J, = k(di, •) and [G^ = k(a,i, bj), for given 
sets A, B and kernel k. 

Now, turning to the specific expression of interest, i.e., 'J in (15), we can form an em- 
pirical estimate based on V = {(x,x')i m } sampled i.i.d. from a joint distribution 
P(X',X) = p v o(X'\X)(j,(X), s.t. p n o(X'\X) is the p.d.f. of X l+1 \X, and \x is a free prior. 

Specifically, assume the representation of $^ in is gg/3, which we do not assume to be 
finite dimensional. Then, given a empirical estimate "J^+i = g^a i+ i, based on some set A, 
we obtain the estimate 

where denotes the Hadamard product. The term G^ B f3 takes- assuming without loss of 
generality, $ € "H^- the particularly simple form 

G£ e /3 = $(X, X') = (Qixux'J, $(x u x' 2 ), . . . ) T . (19) 

Hence, obtaining an explicit representation of or indeed choosing H^, is not necessary. 

Importantly, note that is a finite weighted sum of kernels, hence, G H^, which 
directly allows a recursive computation of all ... n and leads, using (5), to an approximate 
optimal policy for fine discretisations of the problem. Furthermore, all required matrices are 
functions of the sample data only and as such can be pre-computed. Finally, the estimator 
is consistent (see supplementary material for proof). While for bounded expected costs, 
convergence of * implies convergence of the estimate of the expected cost (see supplementary 
material), convergence of the latter can be slow for large values due to the log transform, 
leading in practice to poor policies in regions where ^ is small. We would like to emphasize 
that this problem is not limited to the methods proposed here, but is a characteristic of any 
approach based on estimation of <£, e.g., as also noted by [21]. To overcome this problem 
in practice, we form a Laplace approximation to * at a local mode and use it where * is 
small - this corresponds to a local quadratic approximation of the value function, resulting 
in a linear policy which steers the system towards regions of high ^. 

4 Efficient Estimators 

The basic estimator (18) has several drawbacks. For one it has a relatively high computa- 
tional complexity of 0(m 3 ) for the matrix inversion, only required once if the same V is used 
in each time step, and subsequently 0(m 2 ) per iteration. Additionally, sample data under 



*VB> 



( G XX 



emiy 



(18) 
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the uncontrolled dynamics is required, thus not allowing for off-policy learning. To overcome 
these problems two alternative estimators based on weighted samples, which partly address 
these issues, are discussed in the supplementary material. Specifically, the estimator employs 
Gram-Schmidt orthogonalisation, presented previously by [12], which reduces the computa- 
tional complexity to 0(m 2 ) with 0(rh 3 + m 2 m) pre-computations for a chosen ra -C m, and 
a novel importance sampling based estimator. We choose to defer the discussion of these in 
order to address a, in our opinion, often overlooked aspect of efficiency when solving varying 
problems under the same dynamics. In practice, tasks are not performed in isolation, rather 
varying instances of often related problems have to be solved repeatedly, e.g., an optimized 
single reaching movement is of limited use since complex interactions require a series of such 
movements with changing start and target states. Previous approaches generally assume 
re-initialisation for each problem instance, e.g., Monte Carlo methods require novel samples, 
even under such trivial changes as the start state. In the following, we discuss extensions 
to the proposed method which improve sampling efficiency in exactly these cases, allowing 
efficient sample re-use over repeated applications. 



4.1 Transfer Learning via Transition Sample Re-use 

A limitation of the estimator arising in practice is the necessity of evaluating $ at the 
training transitions (cf. (18) and (19)) which, in general, may be infeasible. It is therefore 
desirable to obtain an estimator based on evaluation of $ on a separate, ideally arbitrary, 
data set V . Observe that 

*(©') 

G V = <*> 4>(P, •)> = <*, Cf z (cf z ) 1 cj>(V, •)) « iFG&iG+yv + em'iy'GU , 

where Z is an some free random variable with support on R d * x R d * and we used an empirical 
estimator based on a data set T>' = {(x, x')i m i} of i.i.d. samples from Z (often in practice 
T>' C T>). As indicated evaluation of the r.h.s. only requires evaluation of $ at elements 
of V, hence substituting into (18) gives the desired result. In particular we are now able 
to pre-compute and re-use the inverse matrix of (18) across changing tasks and, assuming 
time stationary dynamics, across different time steps. This is of importance for efficient 
estimation in, e.g., the Reinforcement Learning setting where incurred costs are known only 
at observed transitions or in cases where $ can be freely evaluated but it is expensive to 
do so, while generating large sets of transition samples may be comparatively cheap, e.g., 
the case of simple kinematic control where cost evaluation requires collision detection. Note 
that this form makes explicit use of the kernel 4>, and while we may not be able to guarantee 
$ € H^, by choosing a kernel such that the projection of $ onto is close to $, we can 
expect good results. 



4.2 Task augmented sampling 

We now turn to the question of the sampling distribution. While in general samples are 
required from the task agnostic dynamics X°, a task often induces regularities which suggests 
more suitable sampling distributions. In particular considering the role 3> takes in (18) as a 
weight vector, it appears desirable, akin to importance sampling, to concentrate samples in 
regions of high <f>. Obviously $ can be used to guide the choice of the prior /i (cf. Section 
3.2), however, in the context of repeated tasks we can go further and incorporate <& partly 
into the sampling process allowing, amongst others, for incremental learning of the task. 

Consider the specific situation where one wishes to execute several task instances of a generic 
skill. This situation is often characterised by an invariant cost component relating to the 
skill and a task specific cost component - if one looks at walking as an example, we wish 
to stay balanced in each step but the foot placement target will differ from step to step. 
Formally assume the state cost decomposes as 

C(x, 8, t) = C aW „(x, t) + C task ( X , 0, t) , (20) 

where 9 parameterises the task. In this case, we may write the path integral (4) as 

* = E x „ (t ^ T) | Xt [ e -/ t T i^ sfc (^(t)^,t)^ ( ^(r),T)l , (2i) 
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where the expectation is now taken w.r.t. path of X v , which are modified dynamics which 
absorb the invariant skill component of the cost, i.e., they bias the path dynamics based on 
the Fokkcr-Plank equation 

dtV = _9™L V _ Vx {fv) + V 2 (BB T i/) , (22) 

in other words, the augmented dynamics tends to restrict the solutions to lie on, or at least 
stay close to, some skill space. 

A practical approach for exploiting the induced structure, is to learn the relevant subspacc 
from a few example demonstrations sampled, using e.g. the approach in [5], and sample T> 
on the learned space. Such explicit learning of the space has several advantages; foremost, 
we can use knowledge of the space to choose an appropriate kernel. Also, while C tas k is 
generally well defined by specific objectives we wish to achieve, the skill component often 
takes a more abstract form, e.g. we may desire movements to overall appear 'natural', and 
may only be given implicitly by expert demonstrations of desired movements, in which case 
the proposed framework allows (21) to be used to perform optimal control without explicitly 
referring to the implicit costs. 

5 Experimental Validation 
5.1 Double Slit 

We first consider the double slit problem, previously studied by [7] to demonstrate Monte 
Carlo approaches to path integral control. The problem is sufficiently simple to allow for a 
closed form solution for ^ to be obtained, but complex enough to highlight the shortcomings 
of some previous approaches. The task concerns a particle moving with constant velocity 
in one coordinate, while noise and controls affects it's position in an orthogonal direction. 
The aim is to minimise the square error to a target position at some final time, while also 
avoiding obstacles at some intermediate time, as illustrated in Fig. 1(a). Specifically, the 
one dimensional dynamics are dx = u + d£ and the cost is given by 

_ , . . , 2 , „, flO 4 if t = ^ and x e Obstacle , . 

C.{x) = u{x - xtarget) and C(x, t) = < ^ 2 , (23) 

where uj is a weight. We considered a discretisation with time step 0.02s, i.e. 100 time steps. 

We compare the true_ optimal policy to those obtained using two variants of the proposed 
estimator, ^oc and "Jrx- The latter is based on a Reinforcement learning setting, learning 
from trajectory data without access to the cost, and uses the approach for sample sharing 
across time steps discussed in Section 4.1. Meanwhile, \&oc is based on single transitions 
from uniformly sampled start states and uses knowledge of the cost function to evaluate $ 
in each step. In both cases we use the low rank approximation (see supplementary material) 
and square exponential kernels ip{x, y) — exp{(a; — y) 2 /A} with A set to the median distance 
of the data. For comparison, we also consider two alternative approaches - firstly, the 
trajectory based Monte Carlo approach of [16], using the same number of trajectories as used 
in the Reinforcement Learning setting and on the other hand, a variational approximation, 
specifically a Laplace approximation to the true to obtain a linear approximation of the 
optimal policy. As can be seen in Fig. 1(b), the proposed approach leads to policies which 
significantly improve upon those based on the alternative Monte Carlo approach and which 
are comparable to those obtained from the variational approximation, which however was 
computed based on knowledge of the true <f . In particular, note that the proposed approach 
makes better use of the sample provided, finding a policy which is applicable for varying 
starting positions, as illustrated in Fig. 1(a). As seen from the trajectories in Fig. 1(a), 
the Monte Carlo approach on the other hand fails to capture the multi modality of the 
optimal policy leading to severely impoverished results when applied to starting point B 
without sampling a new data set (cf. Fig. 1(b)). The variational approximation on the 
other hand similarly requires re-computation for each new starting location, without which 
results would also be significantly affected. 
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Figure 1: Results for the double slit problem, (a) Problem setup and mean trajectories 
from policies MC and 'J'rl for two start points are shown. Obstacles and target are shown 
in gray, (b) Empirical expected cost for policies based on various methods for the two start 
states, (c) The true ^ (top) and the estimate "Joe (bottom) based on 10 4 samples, (d) The 
Li error of estimates of ^(-,0) as a function of (transition) sample size, n.b. in case of ^ 
and \&rl data was sampled as 100 step trajectories, for various estimators. 



To illustrate the dependence of the estimate on the sample size we compare in Fig. 1(c) the 
evolution of the L\ error of the estimates of ^ at time t = 0. Sample size refers to total 
number of transition samples seen, hence for ^rl the number of trajectories was the sample 
size divided by 100. In order to also highlight the advantages of the sample re-use afforded 
by the approach in Section 4.1, we also compare with ty, the basic estimator given data of 
the same form as 'I'rl, i-e. recursive application of (18) without sample sharing across time 
steps. 

5.2 Arm Subspace Reaching Task 

We consider reaching tasks on a subspace of the end-effector space of a torque controlled 
5dof arm, simulating constrained tasks such as, for e.g., drawing on a whiteboard or pushing 
objects around on a table. Here the skill component consists of moving with the end-effector 
staying close to a two dimensional task space, while the task instances are given by specific 
reach targets. The task space used is a linear subspace of the end effector space, n.b., hence, 
a non linear subspace of the joint space, and the cost comprises the two components 

C sfcli /(x,<) = cj sfciH ||Jy>(x) - j|| 2 and C tosfc (x, 0) = w tasfe ||y>(x) - 6»|| 2 , (24) 

where </?(•) is the mapping from joint to end-effector coordinates, J & j define the task 
subspace, 9 specifies the reaching target and ui's are weights. We again consider position 
control over a 2s horizon with a 0.02s discretisation. 

This task is challenging for sample based approaches as the low cost trajectories are restricted 
to a small subspace, necessitating large sample sizes to obtain good results for an individual 
reaching target, even if, as suggested in [16] and done here, an inverse dynamics policy is used 
which significantly improves end-effector exploration. However, concentrating on the case 
of changing targets, we exploit the ideas from Section 4.2 by assuming the operators have 
been estimated under the skill augmented dynamics 2 (cf. (21)), and consider subsequent 
learning for a novel task using the estimator from Section 4.1, utilising the already estimated 
operators in two ways. On the one hand, they are directly used in the calculation of on 
the other hand, noting, that as the trajectories are only required to provide T>' , hence do 
not have to be sampled under a specific policy, we use the policy arising when considering 
C skill only, i.e., the skill policy associated with 'J computed using the given operators and 

Ctask{-) = 0. 

2 n.b., while here such a sample is generated explicitly, the more time consuming approach of 
using the importance sample based estimator and collecting a sample under X° could be used 
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(a) (b) (c) 




Number Trajectories 

Figure 2: Results in the reaching task, (a) Training trajectories under the skill augumented 
policy (solid blue) and n° (dashed red) with task space, (b) Illustration of the task setup 
and example trajectories of policies after 100 training trajectories for a set of reaching tasks. 
The black dots show individual reaching targets with the arm shown in it's initial pose, (c) 
The Li error of estimates of ^(-jO) as a function of training trajectories measured with 
respect to an estimate trained on 5000 trajectories. The data point coresponding to #traj 
= is based on the estimate is of ^ taking only C 8 kui into account (see text for details). 

The advantage of sampling under the skill policy is illustrated in Fig. 2(a) where sample 
trajectories under both the skill and null policy are shown, demonstrating that the former 
more effectively explores the the task relevant sub space. Mean trajectories for policies 
learned from 100 trajectories for a set of tasks are illustrated in Fig. 2(b). In Fig. 2(c) we 
plot the L\ error of # as a function of trajectories averaged over ten 9. As the true $ is 
not available for this task we show the error w.r.t. a 'J computed from 5000 trajectories, 
principally to illustrate the rapid convergence of the estimator. 

6 Conclusion 

We have presented a novel approach for solving stochastic optimal control problems which 
are of the path integral control form using Monte Carlo estimates of operators arising from a 
RKHS embedding of the problem, leading to a consistent estimate of ^ . While direct appli- 
cation of Monte Carlo estimation to point evaluation of \E' also yields a consistent estimate, 
it is impractical for computation of controls for anything but simple problems, requiring a 
trajectory sample for each state at which an action is to be computed. Although previous 
work, e.g., [16] and similarly [15], has suggested approaches to overcome the problem of 
sample complexity, these sacrifice consistency in the process and we demonstrate that the 
proposed approach significantly improves upon them in terms of generalization to a policy 
(cf. results in Fig.l(a,b)). We furthermore show that the presented estimators allow for 
sample re-use in situations which previously required an entirely novel sample set. In par- 
ticular we consider transfer in cases where execution of several, potentially related, tasks 
on the same plant is required, demonstrating that it is possible to exploit samples from all 
tasks to learn invariant aspects. 

Note that as $ itself defines a local optimal control problem, an alternative perspective 
on the proposed method is as a principled approach to combining solutions to local control 
problem to solve a more complex large scale problem. In future work we aim to elaborate on 
this interpretation by combining the methods persented with alternative approaches, e.g., 
variational methods, which may be sufficient to provide good estimates for the comparatively 
simpler local problems. 

The choice of kernel has been largely ignored here, but one may expect improved results 
by making informed kernel choices based on prior knowledge about the structure of the 
problem. 
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A Alternative embedding 



As indicated in the main text (cf. section 3.1) an alternative representation to the embedding 
(15) exists. Observe that for the purposes of the expectation the conditioning variable is 
fixed and $ is in fact only a function of it's second argument, making it possible to apply 
(9) embedding X i+ i\xi into the tensor space in which the product of and the partially 
evaluated $ resides. Formally define the operator for partial evaluation on V.^ 

TZ X [h] = h(x,-) Vx e R Dx ,h e (25) 

In particular note that for 1Z X : — > H^* where 4> x — (f>((x, •), (x, •)). We can now write 
<j>i(x, •) = TZ X [$j] € Ti^ and application of (8) and (9) to (6) leads to 

*i(x) = E Xi+1 , Xi=x [R x (X i+1 ) ■ (26) 

= (TZ X [$ 4 ] ® , £ w [X l+1 \X t = x]) (27) 

= (K x [*i]®*i + i,U wk o£ k [x]) (28) 

where "H™ = V.^ ® "H^ and fc is some kernel of our choosing on R Dx which again we take 
to be ip. 

Although (28) is formally equivalent to the embedding derived in the main text, i.e., (15), 
the yield different empirical estimates. Specifically, applying (17) to the (28) we obtain 
*i(x) - G^a(x) with 

a(x) = [G+^P G^,o/] T [G xx + enl)- 1 (29) 

Hence, although this approach allows us to evaluate "J^ at specific points, we do not directly 
obtain a a finite dimensional representation of in some RKHS. Furthermore, due to the 
dependence on the evaluation point, the Gram matrix G^ xX ^ K can in general not be pre 
computed. None the less this form may have it's applications for a forward, backwards 
algorithm where G( xA ^ K is used for selection of an active set X for which a's are computed 
in a backwards pass. 

B Alternative Estimators 

We now discuss the two alternative estimators based on weighted samples alluded to in the 
main text (cf. section 4). 

B.l Low rank Approximation 

First, we address the computational complexity of (18), which is C(m 3 ) for the matrix in- 
version, which may be precomputed, and 0(m 2 ) for subsequent computations. Although 
such costs are acceptable for reasonably sized problems, they may prove prohibitive for 
application to realistic robotic systems. However we can apply a Gram-Schmidt orthogo- 
nalisation of g x , as proposed by [12]. Summarising we approximate g x ss g^W^ and 
Ex 1 ~ S x ,Wx' , where y C X, y' C X' and W X ,W X / are weight matrices. Substituting 
into (18) we may then obtain the alternative estimator 

Gt s/ 3 G^a J+1 ] T W^Wj (w,Wj + f mG*/ 1 ) 'gJ/' (30) 

This is computationally advantageous as with \y\ = rh <C m the complexity reduces to 
C(m 3 + rh 2 m) and 0(rh 2 ) for required pre-computations and per iteration respectively, 
often with minimal effects on the obtained results. 

B.2 Importance Sampling 

The estimator (18) is based on a sample from the distribution p^o (X'\X)fj,(X) and while we 
are free to choose /j,, p^o is specified by X i+ i\Xi, i.e. the uncontrolled dynamics. In practice 
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it may be impractical to sample according to the uncontrolled dynamics, e.g., we may wish to 
improve the policy sequentially collecting new sample following the already learned, rather 
then the uninformed, policy. To address such situation we follow the importance sampling 
approach. Specifically note that 



nkl 



E(z',y) 



pi 71 y>\ 



(31) 



where P, Q are the p.d.f.s of the two joints (Z,Y), (Z' ,Y') and we assume Q(z,y) = => 
P(z,y) = 0. Hence given a i.i.d. sample from (Z',Y r ) and empirical estimate of C^ Y is 
given by 



Cv = ^2wik(-,Zi) ®l(-,yi) , with w, = P{z u yi)/Q(z i7 yi) . 



(32) 



Applying these to (11) to obtain an empirical estimate of U, it is easy to show that the 
based on a sample from p n (X'\X)fj,(X), formed from an alternative policy, the estimator 
= g%ai with 

1 T 



OLi 



G VB P G x , A a i+1 



W{G k xx +enI)- 1 



(33) 



is obtained, where W is the diagonal weight matrix with W ri = p 7V o(x' i \xi)/p 7r (x' i \xi) and 
we again assume that p n (x'\x) = =>■ p n o(x'\x) = 0. 

C Proofs and Derivations 



C.l General Results for Path Integral Control 

Theorem 1. Let the optimal value function be bounded, say J*(-,t) < c then, 

||*(-, t) - *(-,t)||oo => \\J*(;t) - J*(;t)\\oo 



(34) 



Proof. From (3), 
Now 



J*(;t) <c^ *(-,*) > c' > 
\J*(;t) - J*(;t)\\oc = SU P |J*(x,t) - J*(x,t)| 

X 

*(x,t). 



=A sup I log 



- Asu x p|log( fey 

<Asu P |log( ^^-^^ + l)| 

x C 



and thus 



(35) 

(36) 
(37) 
(38) 
(39) 

(40) 

□ 



C.2 Convergence of Estimates 

Theorem 2. Under the assumptions in the main text, the assumptions of lemma 3 and 
assuming all relevant kernels satisfy < k(x,x') < 1, the estimator is consistent, i.e., 
\\^i — ^i\\n converges in probability. 
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Proof. Let ^ = U* [<& <g> where is the adjoint of i.e., ^ captures the approx- 

imation arising due to the empirical embedding. Then using general relation |jT/i||-w < 
HT'lWI'illw — ll^ll-f/sll^llw f° r bounds on operators, we have the bound 

- *i\\ n =\\U* [$ ® - W [* ® ||„ (41) 
<||$®* i+ i|| w ||W* (42) 
=||*|| W ||*i + i|kl|W*-W*|| HS (43) 

S v ' 

Now 

- <||*< - + - (44) 

<||H* [$ ® - Z? [* ® ||« + ||$|| wC j (45) 

<||$ ® - $ ® * <+1 || w ||«*|| 2 + (46) 

<||$|| w ||*i+i - + ||*||we< (47) 

where in the last line we used < k(x,x') < 1 => ||W*||2 < 1- As we may further using 
lemma 3 and the union bound construct e s.t. with probaility 1 — S simultaniously for all 
£u e» < e - The result then follows by induction. □ 

C.3 Auxiliary Results 

_ 3 

Lemma 3 (Song et al., 2010). Assume the operator Cy xC xx is Hilbert- Schmidt, then 

\\U-U y \x\\hs = C(A5 +A-5to-5) (48) 

In particular if the regularization term A satisfies A — » and m\ 3 — » oo, f/iera — W /,c ||_f/s 
converges in probability. 

Proof. Sec [13] Theorem 1. □ 
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